Paper Reproduction Notebook

1. Environment Setting

2. Load CRC Dataset

Finished QC in R:

For other collected scRNA-seq datasets, we applied the same filtering steps to 10X Genomics datasets (CRC, KIDNEY, STAD, CRC, HCC, NPC and PAAD).

2.1 Convert to Scanpy Object

2.2 Normalization

After quality control, we applied the library-size correction method to normalize the raw count by using normalize_total function in Scanpy. Then the logarithmized normalized count matrix was used for the downstream analysis.

Actually, there are two main approaches to normalize single cell data:

  1. One is a simple linear scaling to adjust counts such that each cell has about the same total library size. Examples include converting to counts per million (CPM) which does a reasonable job of correcting for differences in library size.

  2. Another methods are more complex, and generally involve parametric modeling of count data to perform nonlinear normalization. These methods are useful when there are more complex sources of unwanted variation (e.g., for highly heterogeneous populations of cells with different sizes).

We usually just stick to the simple, but still need extract attention for some spectial situations. Here is a review Cole,Michael B.et al., Cell Systems, 201930080-8?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS2405471219300808%3Fshowall%3Dtrue#secsectitle0020) for the comparison of normalization methods.

3. Create one merged object

3.1 Create Scanpy Object for newly generated datasets

3.2 Merge and Examine Data

Then, effects of the total count per cell, the percentage of mitochondrial gene count and the percentage of count for heat shock protein associated genes (HSP) were regressed out by using scanpy.pp.regress_out function.

This is achieved by doing a generalized linear regression using these parameters as covariates in the model. Then the residuals of the model are taken as the "regressed data".

4. Compute HVG and PCA&UMAP

4.1.1 Compute HVG

We need to find genes that are highly variable across cells, which in turn will also provide a good separation of the cell clusters.

In brief, 2,000 highly-variable genes were selected for downstream analysis by using scanpy.pp.highly_variable_genes function with parameter “n_top_genes=2000.”

4.1.2 PCA & UMAP

A principal component analysis (PCA) matrix with 100 components were calculated to reveal the main axes of variation and denoise the data by using scanpy.tl.pca function with parameter “svd_solver='arpack', n_comps=100. For visualization, the dimensionality of each dataset was further reduced using Uniform Manifold Approximation and Projection (UMAP) implemented in scanpy.tl.umap function with default parameters.

4. Batches Removal

4.1 Remove Batches from Donors

To remove the batch effects from different donors, we applied bbknn algorithm with parameter “batch_key='patient', n_pcs=100” to obtain a batch-corrected space.

4.2 Remove Batches from Platforms

We run two rounds of Scanorama (Hie et al., 2019), an algorithm that could identify and merge shared cell types among multiple datasets, to remove the batch effects within scRNA-seq datasets of 15 cancer types. First, we applied Scanorama to datasets generated from 3′ library and 5′ library from 10x Genomics to remove the batch effects attribute to these two protocols. Then, a second-round of Scanorama was applied to remove the batch effects resulting from the diverse platforms, including 10x Genomics, MARS-Seq and inDrop.

Create individual AnnData objects from each of the datasets.

4.2.1 ROUND 1: 3′ library and 5′ library

Detect HVG

Compare overlap of the variable genes.

Select all genes that are variable in at least 1 datasets and use for remaining analysis.

run Scanorama

4.2.2 ROUND 2: 10X and Smart Seq-2

Detect HVG

Compare overlap of the variable genes.

Select all genes that are variable in at least 1 datasets and use for remaining analysis.

run Scanorama

5. Supplemental Information